For this investigation, game data from the AFL, the professional league for Australian Rules Football will be interogated. AFL is over 100 years old, and is a team game with 22 players a side, where players aim to outscore the opposition, where they are awarded six points for a goal, or one point if the kick misses the goal to score a ‘behind’ (yes, a sport where you get points for missing!). For more information, I encourage you to check the usual suspects, Wikipedia and YouTube.

The data for this analysis has been scraped from the AFL website, using a Python script contained in this same repo. Appropriate queries return json documents, which have been lightly modified and formatted into csvs of tidy data using Pandas. The data comes from all matches from 2001 - 2015, although not all statistics are recorded for earlier seasons.

There are three data sets that have been collected from the website: - Match Data (md), a list of every match played, with key results about the match. - Player Match Stats (pms), contains a record for each players performance for each match, with vital statistics including, kicks, handballs, tackles, marks, hitouts, goals and behinds. - Player Summary Data (psum), key biographical data for each player, including height, weight, date of birth, draft year and age. This dataset incomplete for some variables.

pms$gameFloat <- with(pms, year + (round-1)/30) # Add a new value to plot games over time
psum$dateOfBirth <- as.POSIXlt(psum$dateOfBirth, format="%Y-%m-%d %H:%M:%S") # Convert dates to date objects
# Add in a column to the player summary indicating how many games are in the data for that player.
# Note that this will be different to total games played for players who debuted prior to 2001.
psum <- group_by(pms, playerId) %>%
  summarise(games = n()) %>%
  ungroup() %>%
  merge(psum)
psum$position <- revalue(psum$position, c("Centre Half Back"="Half Back", "Centre Half Forward"="Half Forward", "Full Back"="Back","Full Forward"="Forward", "Interchange"="Other",
                 "Left Back Pocket"="Back", "Left Forward Pocket"="Forward",  "Left Half Back"="Half Back", "Left Half Forward"="Half Forward", "Left Wing"="Centre",
                 "Midfielder"="Centre", "Right Back Pocket"="Back", "Right Forward Pocket"="Forward",
                 "Right Half Back"="Half Back", "Right Half Forward"="Forward", "Right Wing"="Centre", "Ruck Rover"="Rover", "Substitute"="Other","Vice-Captain"="Other"))

We’re also going to construct one extra dataframe, which is a aggregate of certain player stats, combined with the mean of their game statistics across the period. A subset of this dataset, including only players with more than 20 games, has also been built.

agg <- select(pms, playerId, behinds, bounces, clangers, clearances_totalClearances,
               contestedMarks, contestedPossessions, disposals, dreamTeamPoints, hitouts,
               freesFor, freesAgainst, goals, inside50s, handballs, kicks, marks, marksInside50, tackles) %>%
  group_by(playerId) %>%
  summarise_each(funs(mean)) %>%
  ungroup() %>%
  merge(select(psum, playerId, heightInCm, weightInKg, games, position)) %>%
  subset((!is.na(heightInCm))&(heightInCm>1)&(!is.na(weightInKg))&(weightInKg>1))
 
# Make a subset for players with more than 20 games
agg20 <- subset(agg, games>20)

Data Quality

Looking at whether the data is well formed, we can investigate the number of home games by team. We observe low counts for two teams, GCFC and GWS, who only joined the competition 2-3 years ago. There also appear to be some games with missing labels. On closer inspection, these 49 games are missing all details, and occur inconsistently across the period 2001 to 2011.

homegames_by_team <- group_by(md, homeTeam) %>%
  summarise(games = n()) %>%
  arrange(-games)
ggplot(aes(x=homeTeam, y=games), data=homegames_by_team) + geom_bar(stat='identity')

Now lets check the quality of the data on a between different sets. The games played correlates fairly well with the number of records in the database, except for players with a lot of games played (who debuted before 2001), and players who have 0 recorded games.

ggplot(aes(games, careerGamesPlayed), data=psum) + geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).

When we look at the dates of birth, draft and debut, the data is clearly incomplete. The date of birth would be expected to be fairly consistent by year, but clearly many older players don’t have DoB recorded. Additionally, draft and debut information only appears to be available for younger players.

ggplot(aes(dateOfBirth), data=psum) + geom_histogram(binwidth=365*24*3600)

ggplot(aes(debutYear, draftYear), data=psum) + geom_jitter()
## Warning: Removed 1014 rows containing missing values (geom_point).

EDA

Univariate Plots Section

First lets investigate the nature of some key variables. First, the number of recorded games per player. Results appear to be well formed, with no obvious outliers. Note that 25% of players play only 14 games or less in this period; very few players actually manage to build a career in the game; and the median number of games is only 44. The maximum number of games recorded is 330, which fits well with our teams playing around 170 home games over the period (along with around 170 away games).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   14.00   44.50   69.86  111.20  330.00

Look at the distribution of height and weight for the players. The height distribution looks almost normal, although there are twice as many players who are 180cm compared to 179cm - clearly some rounding happening there. Also an outlier at 200cm, but curiously not at 190cm. Weight distribution is similar, but with slightly longer tails. Same rounding anomolies at 80kg and 100kg. No big outliers here.

Now visualise the distributions of different game statistics, for instance kicks and handballs. These both appear to be similarly distributed; the distribution appears to be similar to log-normal or weibull. Most players appear to have more kicks than handballs.

p1 <- qplot(x=kicks, data=pms, binwidth=1)
p2 <- qplot(x=handballs, data=pms, binwidth=1)
grid.arrange(p1, p2, nrow=2)

Most players score 0 goals per game, which is unsurprising given an average side will score around 15 goals in a game, shared amongst 22 players. Transforming the y-axis by log10 reveals the long tail of the distribution.

p1 <- qplot(x=goals, data=pms, binwidth=1) + scale_y_log10(limits=c(1,3e5))
p2 <- qplot(x=behinds, data=pms, binwidth=1) + scale_y_log10(limits=c(1,3e5))
grid.arrange(p1, p2, nrow=2)
## Warning: Stacking not well defined when ymin != 0
## Warning: Stacking not well defined when ymin != 0

Hitouts are a unique part of the game, where the ball is bounced or thrown up by the umpire, and a key-position player named the ruck will attempt to tap the ball to a player on their side, known as a hitout. Because only a handful of tall players will play in the ruck, there are a large number of players with no hitouts. Beyond that the tail is very long, but decreases sumwhat rapidly above 20 hitouts.

qplot(x=hitouts, data=pms, binwidth=1) + scale_y_log10(limits=c(1,3e5))
## Warning: Stacking not well defined when ymin != 0

Finally we can look at how the aggregate statistics compare to the individual game statistics. As expected, there is a reversion towards the mean due to large number theory.

p1 <- qplot(x=kicks, data=agg20, binwidth=1) + xlim(0,30)
p2 <- qplot(x=kicks, data=pms, binwidth=1) + xlim(0,30)
grid.arrange(p1, p2, nrow=2)

Univariate Analysis

What is the structure of your dataset?

There are three data sets that have been collected from the website: - Match Data (md), a list of every match played, with key results about the match. - Player Match Stats (pms), contains a record for each players performance for each match, with vital statistics including, kicks, handballs, tackles, marks, hitouts, goals and behinds. - Player Summary Data (psum), key biographical data for each player, including height, weight, date of birth, draft year and age. This dataset incomplete for some variables.

What is/are the main feature(s) of interest in your dataset?

Statistics for players across different games,.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Players simple physical characteristics (weight, height), and their number of games experience, to see if these influence the occurence of them performing certain roles.

Did you create any new variables from existing variables in the dataset?

A simple numerical indicator for each game was built as the composite of the year and round number, to allow for easier time series plotting without giant gaps in the off-season.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

A new aggregated dataset was constructed to reduce the volume of data, and to better gauge players average performances. Additionally, the number of positions was reduced from 24 to 9 to simplify interpretation. Otherwise the data was well formatted during the scraping and data wrangling in Python.

Bivariate Plots Section

To kickstart the investigation into relationships between variables, multiple scatterplot matrices were built between various combinations of variables. Some variables that are expected to correlate well do, e.g. goals and behinds (0.93), height and weight (0.85) and marks inside 50m with goals (0.93). However the more interesting combinations are: - Tackles and clearances (0.65), probably because players who play ‘on the ball’ are heavily involved in closer scuffles in the game, hence will rack up both stats. - Frees kicks for and against are correlated with 0.47, which indicates that the free kicks paid for and against most players balance out. - There is a weak negative correlation between height and kicks (-0.41), although perhaps this is more related to the role that players have on the ground, rather than anything to do with their skill at kicking.

(Note: high-res versions of these plots are included in the repo)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

When we look closer at the relationship between free kicks for and against individual players, we notice that although it balances out for most players, there is still a handful of ‘dirty’ players who concede 2-4 times the number of free kicks as they receive.

p1 <- qplot(x=freesFor, y=freesAgainst, data=agg20)
p2 <- qplot(x=freesAgainst/freesFor, data=agg20)
grid.arrange(p1, p2, ncol=2)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Finally, hitouts were moderately strongly correlated with height in the previous scatter matrices. However, if we filter on players who are frequently involved in hitouts, i.e. play in that role, by filtering for average hitouts greater than 5, we find that the correlation is less strong. Specifically, after applying the filter the correlation coefficient has dropped from 0.594 to 0.439. Clearly height is only a component in performing well in this role, skill and jump height probably also play an important role.

hitout_df = subset(agg20, hitouts>5)
qplot(x=heightInCm, y=hitouts, data=hitout_df, geom='jitter')

cor.test(hitout_df$heightInCm, hitout_df$hitouts)
## 
##  Pearson's product-moment correlation
## 
## data:  hitout_df$heightInCm and hitout_df$hitouts
## t = 5.0795, df = 108, p-value = 1.594e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2744511 0.5787792
## sample estimates:
##       cor 
## 0.4391265

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Described above.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Described above.

What was the strongest relationship you found?

Goals and behinds, marks inside 50m and goals. These relationships are unsurprising, given a mark inside 50m leads to a shot on goal, and behinds happen when you miss the goal.

Multivariate Plots Section

Lets take a look at the accuracy of players in kicking goals. In AFL, there are four posts to kick at, the inner goal posts, which will score 6 points, and the outer behind posts, which only score 1 point. The accuracy of a player kicking for goal is the number of goals, divided by shots on goal (i.e. goals+behinds). The accuracy has been evaluated for all players with at least 50 shots on goal, and this has been further subdivided for goals by position and the accuracy of players in each position. Surprisingly, median forwards only seem to be marginally more accurate than non-forwards: median 60% vs 57.8%. However, the long, high tail of the box plot suggests that the best forward are very accurate; over 75%.

goal_temp <- select(pms, goals, behinds, playerId) %>%
  group_by(playerId) %>%
  summarise(shot_count = sum(goals)+sum(behinds),
            goals = mean(goals),
            behinds = mean(behinds),
            accuracy = mean(goals)/(mean(goals)+mean(behinds)),
            games = n()) %>%
  ungroup() %>%
  merge(select(psum, playerId, position)) %>%
  subset(shot_count>50)

p1 <- ggplot(aes(x=goals, y=behinds), data=goal_temp) + geom_point() + geom_smooth()
p2 <- qplot(x=accuracy, data=goal_temp)
p3 <- qplot(x=position, y=goals, data=goal_temp, geom='boxplot')
p4 <- qplot(x=position, y=accuracy, data=goal_temp, geom='boxplot')
group_by(goal_temp, position) %>% summarise(median(accuracy))
## Source: local data frame [8 x 2]
## 
##       position median(accuracy)
##         (fctr)            (dbl)
## 1       Centre        0.5679012
## 2    Half Back        0.5702479
## 3 Half Forward        0.5902778
## 4      Forward        0.6024784
## 5         Back        0.5842697
## 6        Other        0.5733333
## 7        Rover        0.5576923
## 8         Ruck        0.6055046
median(with(subset(goal_temp, position != "Forward"), median(accuracy)))
## [1] 0.5773196
grid.arrange(p1, p2, p3, p4, ncol=2)
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

When we colour the scatter plot by position, we see that forwards and half-forwards dominate the high-scoring section of the plot, as expected.

ggplot(aes(x=goals, y=behinds), data=goal_temp) + geom_point(aes(color=position))

Finally, lets look at the change in the game over time. By plotting movement in certain statistics over time by team, and as an aggregate, we can observe certain patters. For instance, the number of kicks per player per match has increased only slightly over time, although there have been periods where Collingwood (COLL) have played a high-kicking game (2006-2011). However number of handballs increased significantly over the course of the decade, a trend initially started by West Coast (WCE) in 2006-7, but practiced heavily by Geelong (GEEL) through 2007-2010. However the trend has reversed and stabilised in recent years.

my_df <- select(pms, year, gameFloat, teamAbbr, handballs, kicks, tackles, clearances_totalClearances) %>%
  group_by(year, gameFloat, teamAbbr) %>%
  summarise_each(funs(mean)) %>%
  ungroup()
p1 <- ggplot(aes(x=year, y=kicks), data=my_df) +
  geom_line(aes(color=teamAbbr), stat="summary", fun.y="mean") +
  geom_smooth(aes(x=gameFloat-0.5), color='black')
p2 <- ggplot(aes(x=year, y=handballs), data=my_df) + 
  geom_line(aes(color=teamAbbr), stat="summary", fun.y="mean") +
  geom_smooth(aes(x=gameFloat-0.5), color='black')
grid.arrange(p1, p2, ncol=1)
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Throughout the decade there has been a lot of talk about ‘congested’ football, with many players close to the ball with great intensity. This can be seen in the tackle count rising steadily throughout the decade from 2001-2011. Sydney (SYD) have been known throughout this period for playing tight, congested football, which is seen with them leading the league in tackle count frequently throughout this period. Additionally, Sydney prominently led clearances (getting the ball out from a congested zone) throughout much of the decade too, especially in 2005 - 2013, where they were typically high above league averages.

p1 <- ggplot(aes(x=year, y=tackles), data=my_df) +
  geom_line(aes(color=teamAbbr), stat="summary", fun.y="mean") + 
  geom_smooth(aes(x=gameFloat-0.5), color='black')
p2 <- ggplot(aes(x=year, y=clearances_totalClearances), data=my_df) + 
  geom_line(aes(color=teamAbbr), stat="summary", fun.y="mean") + 
  geom_smooth(aes(x=gameFloat-0.5), color='black')
grid.arrange(p1, p2, ncol=1)
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Discussed above.

Were there any interesting or surprising interactions between features?

Discussed above.

Final Plots and Summary

Todo

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three


Reflection